Compound words in large-vocabulary German speech recognition systems
نویسندگان
چکیده
This paper analyzes the impact of German compound words on speech recognition. It is well known that, due to an idiosyncrasy of German orthography, compound words make up a major fraction of German vocabulary. And most OutOf-Vocabulary (OOV) compounds are composed of frequent words already in the lexicon. This paper introduces a new method for handling the components of compounds rather than the compounds themselves. This not only reduces the vocabulary, and therefore the perplexity, but also improves word accuracy. And reduced perplexity means a more robust language model.
منابع مشابه
Spoken Term Detection for Persian News of Islamic Republic of Iran Broadcasting
Islamic Republic of Iran Broadcasting (IRIB) as one of the biggest broadcasting organizations, produces thousands of hours of media content daily. Accordingly, the IRIBchr('39')s archive is one of the richest archives in Iran containing a huge amount of multimedia data. Monitoring this massive volume of data, and brows and retrieval of this archive is one of the key issues for this broadcasting...
متن کاملCompound Word Recombination for German LVCSR
Compound words are a difficulty for German speech recognition systems since they cause high out-of-vocabulary and word error rates. State of the art approaches augment the language model by the fragments of compounds in order to increase lexical coverage, lower the perplexity and out-of-vocabulary rate. The fragments are tagged in order to concatenate subsequent equally tagged fragments in the ...
متن کاملGrapheme based speech recognition for large vocabularies
Common speech recognition systems use phonetically motivated subword units. To utilize words in these systems, one has to translate the available graphemic word representation into a phonetic one. To reduce this manual effort we propose to build grapheme based recognition systems. They can be used as speech interfaces for devices that can provide a graphemic representation of words like city na...
متن کاملGeneration of Adaptive Vocabulary Lexicon for Japanese LVCSR
One of the thorniest problems of large vocabulary continuous speech recognition systems is the large number of out-of-vocabulary (OOV) words. This is especially the case for the languages like Japanese, which has many inflections, compound words and loanwords. The OOV words vary with the application domains. It's not realistic to have a big general-purpose lexicon including any possible 00V wor...
متن کاملAdaptive vocabularies for transcribing multilingual broadcast news
One of the most prevailing problems of large-vocabulary speech recognition systems is the large number of out-of-vocabulary words. This is especially the case for automatically transcribing broadcast news in languages other than English, that have a large number of inflections and compound words. We introduce a set of techniques to decrease the number of out-of-vocabulary words during recogniti...
متن کامل